Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add model merge example #5741

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Feb 26, 2024

I don't know if it's a good idea or not.

Still WIP, not tested, would be nice if some one can test it out.

usage: ./merge ./path/model_1 CONFIG1 ./path/model_2 CONFIG2 ./path/output

  CONFIG must be in format: p0-p1,p2-p3,p4,... Example: 0-5,7,8-12
  Optionally, you can specify the scaling for a range of layers, for example: 0-5*0.5,6-7*1. By default, scale will be 0.5. The number of layer start counting from 0.
  The embedding layer of the first model will be used
  NOTE: currently, only F16 model type is supported

@ngxson ngxson added the help wanted Extra attention is needed label Feb 26, 2024
@ngxson ngxson changed the title Add model merge example WIP: Add model merge example Feb 26, 2024
@sorasoras
Copy link

#4718 (comment)

For this Pr, I think in addition to merge two model, It should also add feature to evaluation of a single layer multiple times.
Just reconfigure the same gguf.

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 27, 2024

@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata).

The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future.

@sorasoras
Copy link

sorasoras commented Feb 29, 2024

@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata).

The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future.

That's fair, but I was thinking changing metadata is easier to implement and test on existing models.
It's harder to know what work or not when franklin merge different model.
Anyway, Thanks for the hard work.

@dnhkng
Copy link

dnhkng commented Feb 29, 2024

I would be interesting in layer interleaving. Is this only for merging layers' weight linearly? Or can it do pass through?

Also this line is not entirely clear:
CONFIG must be in format: p0-p1,p2-p3,p4,... Example: 0-5,7,8-12
It looks sequential, and only one config is given, so it's not clear what the second model's config should look like.
If one mode has: 0-5,7,8-12, what should the config of the other model be? the gaps?

Most frankenmerges for passthough are done so:

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 20]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [10, 30]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [20, 40]
    model: 152334H/miqu-1-70b-sf
...

Can this kind of repeat of blocks be done with this code?

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 29, 2024

@dnhkng Yeah in fact I have a typo error in 0-5,7,8-12, it should be 0-6,7,8-12

This PR only aims to merge the weight linearly, meaning it does not add or remove any layers to the merged model.

One thing I don't understand in the lazy merge kit format though, can you please clarify it?: does the interleaving means some layers are repeated (for example, [0-20] + [10-30] results in [0-10] + [10-20] + [10-20] + [10-30])

Thank you in advance.

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 29, 2024

Yeah in fact I have a typo error in 0-5,7,8-12, it should be 0-6,7,8-12

It's true that the logic for my CONFIG argument is not correct. In fact, it should always be used with the "scale". For example, if I want to take 0-7 from model A and 8-12 from model B:

CONFIG1 = 0-7*1,8-12*0
CONFIG2 = 0-7*0,8-12*1

But I'm planning to re-design the whole thing though, to prepare support for the "repeated layers" option

@dnhkng
Copy link

dnhkng commented Feb 29, 2024

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 10]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [5, 15]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [10, 20]
    model: 152334H/miqu-1-70b-sf
...

This would result in:
0,1,2,3,4,5,6,7,8,9,5,6,7,8,9,10,11,12,13,14,10,11,12,13,14,15,16,17,18,19...

This is why Frankenmerge models are larger than base models.

Personally, I would be interesting in a hybrid approach, with the ability to merge and layer!
i.e. We want this particular output from 2 models ( for one model, we could just use it again as the second model), which we'll call 'a' , and 'b' for brevity. We want to use a mixture of interleaving and layer merging, to get this final output. In this case, the first 3 layers are from model a, the forth is a mix of model a+b, and the next few layers repeat layers from model b:
[a0, a1, a2, a3*0.5+b3*0.5, b4, b5, b6, b5, b6, b7]

Trying to stay with your parameter notation, the closest I could get for the 2 configs would be:
model_a 0-2*1,3*0.5,0-5*0 model_b 0-2*0,3*0.5,4-6*1,5-7*1

As both configs must be the same length, for model_a we used 0-5*0 as filler at the end.
Does that make sense?

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 29, 2024

Thanks for the explanation.

This is why Frankenmerge models are larger than base models.

According to discussion #4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.

Personally, I would be interesting in a hybrid approach, with the ability to merge and layer!

Trying to stay with your parameter notation, the closest I could get for the 2 configs would be: model_a 0-2*1,3*0.5,0-5*0 model_b 0-2*0,3*0.5,4-6*1,5-7*1

Having both merge + repeated layers is great. But for that, I think the whole notation that I invented 0-2*1,3*0.5,0-5*0 is just far too limited. I propose more readable syntax (written to a file) like:

a0*1 + b0*0
a0*1 + b0*0
a1*0 + b1*1

The file above results in output model having:

  • Layer 0: Model A layer 0
  • Layer 1: Model A layer 0
  • Layer 2: Model B layer 1

It's not as robust as lazy merge kit syntax (yml), but give us more space to improve in the future.

Additional, someone can easily write a python script to convert lazy merge kit yml to my syntax.

What do you think about this approach?

@dnhkng
Copy link

dnhkng commented Feb 29, 2024

Sure, I think we should do it. I was about to start testing Mergekit now, but I can quickly switch gears and write Python converter script.

According to the discussion #4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.

Yes, that would be a better method. I have a large model I know quite well I've merged manually in ExllamaV2.It took a bit to sort out KV caching though, and there are issues when the model spans multiple GPUs. At first, I would just duplicate.

If you can generate the merging code, I can compare the results of your method to the measured result of my merge.

Update: I could write the Python converter, but now that I look in more detail, I think the layer-by-layer method here is much more powerful. Mergekit only allows either slice interleaving OR linear/spherical interpolation of all layers. The config model you describe is more verbose, but much more powerful. I would prefer that TBH.

TBH, there are two options, 1) easy parsing with just 3 values:

model-a layer, model-b layer, weight of model-a
0,0,1
0,0,1
1,1,0
2,2,0.5

Or YAML, and give all the details:

sources:
  - model-a: 152334H/miqu-1-70b-sf
  - model-b: 152334H/other-model-b-70b-sf
  - model-c: 152334H/other-model-c-70b-sf      # we can then add as many models as we want
layers:
  - 1:
    model-a:
       layer_source:1
       weight:0.5
    model-b:
       layer_source:1
       weight:0.5
    method:linear               # and offer various interpolation methods
  - 2:
    model-a:
       layer_source:2
       weight:0.0
    model-b:
       layer_source:2
       weight:1.0
    method:linear
  - 3:
    model-a:
       layer_source:3
       weight:0.3
    model-b:
       layer_source:5
       weight:0.3
    model-3:
     layer_source:5
     weight:0.4
    method:slerp
  - 4:
    model-a:
       layer_source:4
       weight:1.0
    method:none               # and do straight passthrough of a single layer if needed

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 29, 2024

Thanks for the input, I'll need to rework this PR in the next days.

Regarding the format, I still having ability to specify weight of a and b separately can be interesting. I don't know what will happen if we take weightA*0.5 + weightB*0.6 for example (so the total weight becomes 1.1). It's also useful when you merge 3 models, the first pass can have weightA*0.33 + weightB*0.33 then second pass + weightC*0.33

The csv format should simplify the cpp parser code though, I'll consider that.

YML format is readable, but unfortunately we can never include a yml parser in llama.cpp.

However, having it as the input of your python script (and the python convert that yml into csv or something llama.cpp can understand) will be very useful.

@dnhkng
Copy link

dnhkng commented Feb 29, 2024

Yes, the YAML could be converted to CSV easily, if we leave out various interpolation types.

For completeness, I would explicitly put in all weights, and normalise to reach a sum of 1.0
i.e. for two models:

model-a layer, model-b layer, weight of model-a, weight of model-b
0,0,1.0,0.0
0,0,1,0.0.0
1,1,0.0,0.0
2,2,0.5,0.5

and for three models:

model-a layer, model-b layer, model-b layer, weight of model-a, weight of model-b, weight of model-c
0,0,0, 1.0,0.0, 0.0
0,0,0, 1.0,0.0, 0.0
1,1,1, 0.0,1.0, 0.0
2,2,2,0.5,0.5,0
3,3,3,0.3,0.3,0.3

The last layer here gets normalised to 1/3, 1/3, 1/3.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2024

@dnhkng I updated by PR to have the ability to:

  • Merge multiple models at once (not just 2 models)
  • Use the CSV format that we discussed

To simplify my CSV parsing code, I choose the column in order "model - scale - model - scale" (instead of "model - model - scale - scale"

0,1.0,0,0.0    meaning: output layer 0 = A[0]*1.0 + B[0] * 0.0
0,1.0,0,0.0    meaning: output layer 1 = A[0]*1.0 + B[0] * 0.0
1,0.0,2,0.0    meaning: output layer 2 = A[1]*0.0 + B[2] * 0.0
2,0.5,1,0.5    meaning: output layer 3 = A[2]*0.5 + B[1] * 0.5

If you add the third model, the columns become "model - scale - model - scale - model - scale"

I tried it myself and confirmed that the output model can be loaded, inference without any problem. What I could not verify is that the merging result (semantic result) is good or not (in other words, did it do A*scale + B*scale correctly or not). Can you verify this? Thank you!

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2024

FYI, I was also thinking adding ability to merge quantized model, but at this stage it's quite tricky: I must dequantize it, do calculations with float then re-quantize it again. Currently I'm staying with single-thread model for simplification, but the whole "dequant-requant" thing should be done with multi-threading, too tricky for now.

@dnhkng
Copy link

dnhkng commented Mar 1, 2024

Could you add a branch for pass-through (no linear interpolation) of quantized models?

I have a use case for that right now!

i.e. a single model quantized model, with repeating layers.

This issue is that, from my tests, model self-merging only starts to help from 34B models and up. At FP16, that's a huge amount of RAM required!

I have a model that is a positive outlier on a difficult LLM benchmark, so it should be relatively clear whether the merge worked. It's a 70B model, so I'll need to run the tests on an 80Gb GPU. Interpolating layers would be an added benefit in the future though!

I will pull your code and try on FP16 Llama7B now, but I know all outputs will be worse than the base model. However, I know regions of "really bad", and "slightly bad", so I can see if it is at least making sense.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2024

I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me.

Also, just for my curiosity: if you merge the model then use ./quantize to re-quant it again, does that work for you? This way it takes a lot of disk space, but you'll eventually get a model small enough to fit into RAM.

One thing I'll try to work on is ability to re-use same tensor for repeated layer. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal)

@dnhkng
Copy link

dnhkng commented Mar 1, 2024

Reusing layers makes sense, but the caching is tricky.

There's a discussion on my pull request for ExllamaV2 here: turboderp-org/exllamav2#275

@dnhkng
Copy link

dnhkng commented Mar 1, 2024

I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me.

Also, just for my curiosity: if you merge the model then use ./quantize to re-quant it again, does that work for you? This way it takes a lot of disk space, but you'll eventually get a model small enough to fit into RAM.

One thing I'll try to work on is the ability to re-use same tensor for repeated layers. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal)

I can try Q4 -> FP16 and re-quantization. I'll keep watching this pull request, and test it when it's ready. Intermediate disk space is fine, I have a few SSD Tb free ;)

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2024

Reusing layers makes sense, but the caching is tricky.

Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong).

For example, when you have 2 consecutive layers having same weight W[0] == W[1], then KV[1] = W[1]*(W[0]*KV[0])

P/s: I'm actually bad at math when I was in high school / university. Nowadays with all these machine learning stuff, I still imagine "tensor" to be "rubik cube" in my head

@dnhkng
Copy link

dnhkng commented Mar 1, 2024

Reusing layers makes sense, but the caching is tricky.

Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong).

For example, when you have 2 consecutive layers having same weight W[0] == W[1], then KV[1] = W[1]*(W[0]*KV[0])

Yes, you can't share cache, it would get overwritten on the higher layer processing... But it still works! The results are worse though, but that's not unexpected. The fact that it even slightly works is crazy though.

I have done quite a lot of testing on various permutations of layers, and most are worse. but there are a few interesting combinations. GGUF would be the best way to share them, as going via FP16 torch tensors, then merging, then converting to GGUF and finally quantization seems like a lot of wasted effort! Better to experiment in ExllamaV2 dynamically and build and distribute in GGUF.

@dnhkng
Copy link

dnhkng commented Mar 2, 2024

Tested it with a self-merge today on F16, and it looks good!
Models self-merge repeats I know are bad are also bad with your code, and good models also look good. Passes the first subjective tests :)

I will fire up an evaluation pipeline over the weekend, and do more extensive testing.

Just to clarify:
Does it do interpolation too, with quants? That would be amazing!

Also, Mergekit offers Spherical linear interpolation (SLERP). This seems to offer better merges. (brief description here).

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 2, 2024

Thanks! Glad to know that it works in your test.

Only linear merging is supported for now. SLERP is interesting too and technically possible (because internally we dequantize all matrix to float). However I think I'll do that in later stage (or in another branch).

What's not clear for me though: SLERP works with vector, but we have matrix as model weight. How can SLERP apply to matrix? For example a matrix 4x4, will it be consider as a vector of 16 dimensions, or 4 vectors of 4 dimensions each?

@dnhkng
Copy link

dnhkng commented Mar 2, 2024

What's not clear for me though: SLERP works with vector, but we have matrix as model weight. How can SLERP apply to matrix? For example a matrix 4x4, will it be consider as a vector of 16 dimensions, or 4 vectors of 4 dimensions each?

In PyTorch it seems straightforward. The implementation is here, from line 94:
https://github.com/arcee-ai/mergekit/blob/main/mergekit/merge_methods/slerp.py

I have just bought some cloud compute to test the merged model; I need 80Gb VRAM for it to run at a useful speed. It will take a few hours at least.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 2, 2024

Oh ok thanks for the info. Seems like in the python code, there is no place where the tensor view is changed to 1d. That mean it keeps one row of matrix == one vector.

I can wait, don't worry. I'm trying to refactor the re-quantization part in another PR, so we should get some more performance when having quantized model as output.

@dnhkng
Copy link

dnhkng commented Mar 2, 2024

I can wait, don't worry. I'm trying to refactor the re-quantization part in another PR, so we should get some more performance when having quantized model as output.

Great! Im merging a 70B model, and its not super fast. Many layers are with a 1.0/0.0 weight ratio. Maybe as a backlog item, if a new layer has 100% weight from a model, skip dequantization, merging and re-quantization, and just pass through the layer with 100% weight. Not urgent though. It looks like the merge will take about 30 minutes.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 2, 2024

FYI, I've just pushed a refactor commit that has better multi-thread usage for re-quant operation (using same code as ./quantize tool). You'll now be able to utilize almost 100% CPU for doing re-quant.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 2, 2024

I had a look on mergekit + slerp today. I think I can add slerp in this PR, as it make more sense than linear method. However, I will need to re-invent my input format.

On the blog article, they target specifically some tensors, for example self_attn or mlp

parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5

The current CSV format does not allow specify scaling at tensor level. Therefore, I propose a new format which is inspired by assembly language:

---
all slerp 0,0,0.9
attn_output slerp 0,0,0.9
---
all linear 1,1,0.6,0.4
attn_output slerp 1,1,0.9
---
all linear 2,2,1.0,0.0
---
# repeat the first layer defined earlier in this file
repeat 0
---
repeat 1
---
...
  • Each --- means a new output layer
  • Then, each instruction is in format verb (space) tensor (space) arguments. Verbs that we can have now:
    • linear with arguments in order of source_layer,source_layer,t
    • slerp with arguments in order of source_layer,source_layer,scale,scale
    • repeat to repeat a layer in the same output model
    • Other methods like ties or dare can be added in the future. I also thought about copy which simply copy the layer from one of the source model to the output model
  • For simplicity, we will only allow merging 2 models for now (no more than 2)

I don't know if it's too complicated for your converter script @dnhkng ?

@dnhkng
Copy link

dnhkng commented Mar 2, 2024

OK, the 70B Model merge looks interesting.

The merges go in the same direction I see with ExllamaV2, so I think everything is working OK!

I have one small issue, that I'm trying to figure out still though. I use EQ-Bench to test the models, and weirdly, using llama.cpp server I get significantly worse results than using exllama via oobabooga. The relative changes are all correct, but the absolute scores for the llama.cpp backend are about 75.5, using the original leaked Miqu Q4/5 weights. However, I get a score of 82.7 for the Q4 weights with exllamaV2! A 7 points difference here is massive.

This is extra weird, as the exllama weights are just the Miqu weights that have been de-quantized, converted and re-quantized, so you would not expect them to be so much better (I would expect them to be slightly worse). I've made an issue at the benchmark repo, but maybe someone here might know why this is the case.

@ngxson

I don't know if it's too complicated for your converter script @dnhkng ?

All good, write a config style you like, and I'll write up a python converter :)

So long as the format is sensible, it should be easy to generate a High-Level abstraction. The fallback is to write low-level by hand, for unusual cases. The combination is powerful.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 3, 2024

For the benchmark difference between llama.cpp server and exllama, apart from the chat template that I discuss in the other issue, maybe it's also because KV cache of llama.cpp is f16 by default. (Idk if exllama use f16 or bf16 or f32 for KV; pay attention that even model is quantized, the KV may not be quantized)

I'll start working on the slerp and the new input format today, as the current implementation already output an usable result.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 3, 2024

@dnhkng I added the new format and SLERP, it's slightly different that what I proposed above, but will be easier to understand:

output layer 0
all slerp 0,0,0.1
attn_output slerp 0,0,0.5

output layer 1
all linear 1,1,0.6,0.4
attn_output slerp 1,1,0.9

output layer 2
all copy 0,2

...

You can have a look at config.example.txt for a complete example.

I've tried merging a dolphin-mistral with vistral (mistral but finetuned to understand vietnamese). The output model does speak mixed eng-viet which indicate that my code kinda work. The used merge config is config.example.txt

Feel free to ask if something is not clear for you. Thank you!

@dnhkng
Copy link

dnhkng commented Mar 3, 2024

I'll try and write a high level python configuration generator for the new format.

@jukofyork
Copy link
Contributor

jukofyork commented Mar 5, 2024

If anyone is interested, then I think we should in theory be able to get a better estimate of the original fp16 values for the Miqu model by combining the q_5, q_4 and q_2 quantized values.

I don't really know what criteria llama.cpp is using to quantize the values, but I assume it's to minimise the least squares error? If so then I think we can assume the values come from a normal distribution and then work out the correct weighting factor for the 3 different bin centres we have for every original fp16 value that was quantized.

This obviously won't work for some distributional assumptions, eg: if the original fp16 values came from a uniform distribution, then knowing which of the 4 bins the q_2 came from and which if the 16 bins the q_4 came from gives us no extra information over knowing which of the 32 bins the q_5 came from and the maximum likelihood estimate is still just the centre of the q_5 bin (assuming the bin boundaries all align anyway).

But I think the values are pretty likely to have come from an approximately normal distribution (especially due to all the layer norms in the model, etc) and the correct weightong factors should be findable either analytically or empirically.

Without explicitly working it out, I think the weights will likely be something like the #bins ratio squared (ie: using the conjugate prior formula), but I'm pretty sure it could be worked out empirically quite easily if we know the exact criteria the quantization is using.

It probably won't be a huge increase and at best be around the level of q_6, but it would likely be useful for those remerging the de-quantized fp16 model off huggingface.

@jukofyork
Copy link
Contributor

jukofyork commented Mar 5, 2024

Yeah, to work it out analytically looks quite hard:

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w40/Pouransari_Least_Squares_Binary_Quantization_of_Neural_Networks_CVPRW_2020_paper.pdf

but it wouldn't be hard to estimate the weights empirically as we could just simulate the forward quantization process used to create a q_5, q_4, and q_2 of a standard normal (using least squares criteria or whatever llama.cpp is using) and then find the optimal weighting factors to get the maximum likelihood estimate of the original fp16 value (or something similar anyway).

It may turn out to be a different weighting factor for each of the 32×16×4 combinations, but even this wouldn't be hard to find empirically via simulation.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 5, 2024

This obviously won't work for some distributional assumptions, eg: if the original fp16 values came from a uniform distribution, then knowing which of the 4 bins the q_2 came from and which if the 16 bins the q_4 came from gives us no extra information over knowing which of the 32 bins the q_5 came from and the maximum likelihood estimate is still just the centre of the q_5 bin (assuming the bin boundaries all align anyway).

I agree with that: since we're using qX_K and not qX_0 or qX_1, the difference between 16 bins of q4 and 32 bins of q5 is not that much. Throwing q2 into the equation may make it worse. I assume that dequantizing q5 is already the best result we can get.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 5, 2024

Btw @dnhkng I came across the code for merging embedding & output layers of mergekit, seems like it's also an important part to improve the quality of output model. I'll try to implement that in this week, but quite tricky because sometimes we have models with different vocab size (i.e. added special tokens)

@dnhkng
Copy link

dnhkng commented Mar 5, 2024

Btw @dnhkng I came across the code for merging embedding & output layers of mergekit, seems like it's also an important part to improve the quality of output model. I'll try to implement that in this week, but quite tricky because sometimes we have models with different vocab size (i.e. added special tokens)

Will that mean a new format for the configuration?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 5, 2024

Will that mean a new format for the configuration?

No, don't worry, it will be just an additional (optional) command to add to the current format

@dnhkng
Copy link

dnhkng commented Mar 9, 2024

OK, I've written a YAML parser that converts high-level config files to your format, including some quite complex merges.

ngxson#3

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 9, 2024

@dnhkng Thank you! Seems good, I'll try it tomorrow

@sorasoras
Copy link

@dnhkng Thank you! Seems good, I'll try it tomorrow

haven't seem any progress, any update?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 18, 2024

Yeah sorry I was quite busy since then. The python converter script looks good, but merging this PR (the part that I made) into master is quite risky, since it's quite huge and I doubt if anyone find it helpful in the future.

For now, I think we can consider this PR as a demo. But you can feel free to let me know if you want to change something else.

@ngxson ngxson added the demo Demonstrate some concept or idea, not intended to be merged label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants